Goto

Collaborating Authors

 auto-regressive model






Enabling Approximate Joint Sampling in Diffusion LMs

Bansal, Parikshit, Sanghavi, Sujay

arXiv.org Artificial Intelligence

In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion language models generate text by unmasking tokens out of order and potentially in parallel. Generating an overall string sampled from the correct underlying joint distribution would (again) require exactly one token unmasking in every full-model forward pass. The more tokens unmasked in parallel, the further away the string is from the true joint; this can be seen in the resulting drop in accuracy (but, increase in speed). In this paper we devise a way to approximately sample multiple tokens from the joint distribution in a single full-model forward pass; we do so by developing a new lightweight single-layer "sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only (Dream-7B-Base) and instruction-tuned (Dream-7B-Instruct) models on language modeling and math & coding tasks. When four tokens are unmasked for each full-model denoising step, our sampling algorithm achieves a MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution. Masked diffusion language models Sahoo et al. (2024); Austin et al. (2021); Lou et al. (2023) involve generating text strings by starting from an all-masked sequence of tokens, and then iteratively replacing the masked tokens with tokens from the vocabulary, with each "denoising" forward pass unmasking one or a few tokens. As opposed to auto-regressive models which generate tokens left to right and one token in each forward pass, in masked diffusion models tokens can be potentially unmasked in any order and also potentially multiple tokens can be unmasked in parallel. The higher the number of tokens unmasked in parallel after a single denoising forward pass, the faster and cheaper the overall generation Sahoo et al. (2024).


Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution

Zhang, Shulai, Xu, Ao, Chen, Quan, Zhao, Han, Cui, Weihao, Zheng, Ningxin, Lin, Haibin, Liu, Xin, Guo, Minyi

arXiv.org Artificial Intelligence

Embodied AI systems operate in dynamic environments, requiring seamless integration of perception and generation modules to process high-frequency input and output demands. Traditional sequential computation patterns, while effective in ensuring accuracy, face significant limitations in achieving the necessary "thinking" frequency for real-world applications. In this work, we present Auras, an algorithm-system co-designed inference framework to optimize the inference frequency of embodied AI agents. Auras disaggregates the perception and generation and provides controlled pipeline parallelism for them to achieve high and stable throughput. Faced with the data staleness problem that appears when the parallelism is increased, Auras establishes a public context for perception and generation to share, thereby promising the accuracy of embodied agents. Experimental results show that Auras improves throughput by 2.54x on average while achieving 102.7% of the original accuracy, demonstrating its efficacy in overcoming the constraints of sequential computation and providing high throughput.


Firstly, we thank all reviewers for the helpful comments and suggestions

Neural Information Processing Systems

Firstly, we thank all reviewers for the helpful comments and suggestions. We will add citations in Table 4. We haven't conducted experiments in language modeling and image density estimation Admittedly, modeling the intra-step correlation would require extra computation time. We will add this discussion in the revised version. We are not entirely sure about the motivation of the multi-frame setting.


Theoretical analysis of deep neural networks for temporally dependent observations

Neural Information Processing Systems

Despite the widespread use of neural networks in such settings, most theoretical developments of deep neural networks are under the assumption of independent observations, and theoretical results for temporally dependent observations are scarce.


EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

Sioros, Vassilis, Potamianos, Alexandros, Paraskevopoulos, Giorgos

arXiv.org Artificial Intelligence

In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen


CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving

Zhang, Dongkun, Liang, Jiaming, Guo, Ke, Lu, Sha, Wang, Qi, Xiong, Rong, Miao, Zhenwei, Wang, Yue

arXiv.org Artificial Intelligence

Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios. In this paper, we introduce \textbf{CarPlanner}, a \textbf{C}onsistent \textbf{a}uto-\textbf{r}egressive \textbf{Planner} that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance. Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving. To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.